Nucleation phenomena in protein folding: The modulating role of protein sequence 
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For the vast majority of naturally occurring, small, single domain proteins folding 
is often described as a two-state process that lacks detectable intermediates. This 
observation has often been rationalized on the basis of a nucleation mechanism for 
protein folding whose basic premise is the idea that after completion of a specific set 
of contacts forming the so-called folding nucleus the native state is achieved promptly. 
Here we propose a methodology to identify folding nuclei in small lattice polymers and 
apply it to the study of protein molecules with chain length N=48. To investigate the 
extent to which protein topology is a robust determinant of the nucleation mechanism 
we compare the nucleation scenario of a native-centric model with that of a sequence 
specific model sharing the same native fold. To evaluate the impact of the sequence's 
finner details in the nucleation mechanism we consider the folding of two non- homol- 
ogous sequences. We conclude that in a sequence-specific model the folding nucleus is, 
to some extent, formed by the most stable contacts in the protein and that the less 
stable linkages in the folding nucleus are solely determined by the fold's topology. We 
have also found that independently of protein sequence the folding nucleus performs 
the same 'topological' function. This unifying feature of the nucleation mechanism re- 
sults from the residues forming the folding nucleus being distributed along the protein 
chain in a similar and well-defined manner that is determined by the fold's topological 
features. 
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INTRODUCTION 

Proteins do not appear to fold by means of a unique 
mechanism and over the years several phenomenologi- 
cal models have been proposed for protein folding |l|, 
i, H i, i, i, 0, B S, [13, EH- The framework model, 
for example, is based on the idea that the formation of 
the hydrogen-bonded secondary structural elements pre- 
cedes the formation of tertiary structure [l], Q , and the 
diffusion-collision model assumes that part of the protein 
folding process involves the interaction of metastable re- 
gions of structure which, when in contact, may provide 
additional stabilization 



Chymotrypsin inhibitor 2, a small, single domain, 
two-state folder with 64 residues, epitomizes the so- 
called nucleation-condensation (NC) mechanism for pro- 
tein folding. The latter was firstly investigated by 
Shakhnovich, in the context of Monte Carlo lattice simu- 
lations 0, HI, and by Fersht through extensive protein 
engineering studies [6j termed </>- value analysis. The 
NC mechanism can be viewed as a modified version of 
the nucleation-growth mechanism originally proposed by 
Wetlaufer The basic premise of the NC model is the 
idea that once a specific set of contacts, named the folding 
nucleus (FN), forms there is a concerted consolidation of 
secondary and tertiary interactions as the whole protein 
rapidly collapses to the native fold. 

More recently, the topomer search model, which em- 
phasizes native state's topology as a major determinant 



of protein folding rates has been proposed Q and in- 
vestigated in the context of off-lattice Langevin simula- 
tions 12, [T3 |. While it seems well established that the 
native topology, as measured by the contact order pa- 
rameter 14 1, and other related quantities 15, 3, 17 1 , is 
a major determinant of two-state protein folding kinetics, 
the question of understanding the relative roles played by 
native structure [TU and protein sequence in deter- 
mining the folding mechanism remains to be elucidated 
(reviewed in [20j|). 

In their seminal work Q , Abkevich and coworkers have 
found that native structure is a more robust determinant 
of the folding mechanism than sequence for 36-mer lat- 
tice proteins. Indeed, the results of Monte Carlo simu- 
lations reported by Abkevich and coworkers [3J suggest 
that three non-homologous sequences sharing the same 
native fold also share a common FN. Here we use this 
result as the starting point of a study that is based on a 
novel methodology and on rather extensive statistics. A 
nucleation pattern driven exclusively by native structure 
(and therefore by native topology) is compared with pat- 
terns driven by the combined effects of protein structure 
and sequence. If the FN is determined by native struc- 
ture alone the nucleation patterns of different sequences, 
with the same native fold, should be similar and in addi- 
tion, they should be similar to the nucleation pattern of 
a model whose folding dynamics is driven strictly by the 
structural features of the native fold. 

This paper is organized as follows. The next section 
describes the models used and computational mcthodolo- 
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gies adopted. We then propose a new strategy to identify 
folding nuclei and present and discuss the simulation re- 
sults obtained based on it for three different model pro- 
teins. Finally we draw some conclusions and compare 
our results with those obtained using other strategies and 
simulation efforts. 



MODELS AND METHODS 
Lattice model and simulation details 

We consider a simple three-dimensional lattice model 
of a protein molecule with chain length N — 48. In such 
a minimalist model amino acid residues, represented by 
beads of uniform size, occupy the lattice vertices. The 
peptide bond that covalently connects amino acids along 
the polypeptide chain, is represented by sticks with uni- 
form (unit) length corresponding to the lattice spacing 

(Fig a top). 

In order to mimic the protein's relaxation towards the 
native state we use a standard Monte Carlo (MC) al- 
gorithm 21 1 together with the kink-jump move set [22I ]. 
Local random displacements of one or two beads (at the 
same time) are repeatedly accepted or rejected in accor- 
dance with the standard Metropolis MC rule (2l|. A MC 
simulation starts from a randomly generated unfolded 
conformation and the folding dynamics is monitored by 
following the evolution of the fraction of native contacts, 
Q = q/L, where L = 57 is the number of contacts in 
the native fold and q is the number of native contacts 
formed at each MC step. The number of MC steps re- 
quired to fold to the native state (i.e., to Q = 1.0) is the 
first passage time (FPT). The native conformation used 
in this study together with its contact map representa- 
tion is shown in Figure [TJ 

Unless otherwise specified, folding is studied at the so- 
called optimal folding temperature, T optj the temper a- 
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ture that minimizes the folding time, t, 
[27l | which is computed as the mean first passage time 
(MFPT) of 100 simulations. This optimal folding tem- 
perature may differ from the folding transition tempera- 
ture, Tf, at which the probability for finding the protein 
in an unfolded state is the same as the probability for 
finding it in the native state. In the context of a lattice 
model Tf may be defined as the temperature at which the 
average value of the fraction of native contacts < Q > is 
equal to 0.5 [2a |. In order to determine Tf we averaged 
Q, after collapse to the native state, over MC simula- 
tions lasting at least 20 times longer than the folding 
time computed at T opt . 

Protein energetics is modeled using the Go and the 
Shakhnovich models. 



The Go model 

In the Go model [29( the energy of a conformation, 
defined by the set of bead coordinates {ri}, is given by 
the contact Hamiltonian 
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where the contact function A(f 



j), is unity if any 



beads i and j are in contact but not covalently linked, and 
is zero otherwise. The Go potential is based on the idea 
that the native fold is very well optimized energetically. 
Accordingly, it ascribes equal stabilizing energies, = 
— 1.0, to all pairs of beads i and j that form a contact in 
the native structure, and neutral energies, = 0, to all 
non-native contacts. 



The Shakhnovich model 

By contrast with the Go model, which ignores the pro- 
tein's chemical composition, the Shakhnovich model (see 
e.g., [I3|) addresses the dependence of protein folding 
dynamics on the amino acid sequence by considering in- 
teractions between the 20 different amino acids used by 
Nature in the synthesis of real proteins. Accordingly, 
the contact Hamiltonian that defines the energy of each 
conformation is given by 
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where represents an amino acid sequence, and <7j 
stands for the chemical identity of bead i. In this case 
both the native and the non-native contacts contribute 
energetically to the folding process. The interaction pa- 
rameters e are taken from the 20 x 20 Miyazawa- Jernigan 
matrix, derived from the distribution of contacts of na- 
tive proteins [3ll |. 

Two non-homologous sequences, numbered 1 and 2, 
were studied within the context of the Shakhnovich 
model. The latter were designed to fold into the na- 
tive conformation shown in Figure [1] with the method 
developed by Shakhnovich and Gutin based on ran- 
dom heteropolymer theory and simulated annealing tech- 
niques [321 ] . 

Table U summarizes some kinetic and thermodynamic 
properties of the model proteins discussed above. 



A GENERAL STRATEGY TO IDENTIFY THE 
FOLDING NUCLEUS 

We define the FN as a specific set of native contacts 
which, once formed, prompts rapid and highly probable 
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FIG. 1: The native conformation used in this study (top) and the corresponding contact map (bottom). Each square in the 
contact map represents a non-covalent native contact, i.e., a contact that is not a covalent linkage. 



Sequence 


E nat Topt T/ log 10 (t) 


Go 

1 : EPEWQLEFDNSNYAWPANYAQHLPGMYRFTVFDMQRNHTSCKLCFLFS 

2 : C IFDLEFECPAFP AP I GWLGLVSVVYLFP VRYCRLCMFNCRFKTKTRC 


-57.00 0.65 0.770 5.95 ± 0.03 
-24.34 0.29 0.305 6.84 ± 0.04 
-26.84 0.32 0.332 6.53 ± 0.04 



TABLE I: Kinetic and thermodynamic properties of the three model proteins. The folding time, t, is measured at the optimal 
folding temperature, T op t- Also shown is the folding transition temperature, T/, and the native state's energy E na t- 



folding to the native state. In what follows we render a 
methodology to investigate the existence of folding nuclei 
in the folding of 48-mer lattice polymers whose energetics 
are modeled by the Go or by the MJ potential. 

The vast majority of small (i.e. with less than 100 
amino acids), single domain proteins fold in a two- 
state manner with a relaxation rate following single- 
exponential kinetics [33| . Two-state folding is often ra- 
tionalized through a 'classical' mass-action scheme 34 1. 
Accordingly, the ensemble of conformations that make up 
the unfolded state (U) is separated from the native fold 
(N) by a free energy barrier along some appropriately de- 
fined reaction coordinate. The ensemble of conformations 
that lie on the top of the reaction barrier is the so-called 
transition state (TS). By definition, TS's conformations 
have folding probability Pfoid = 1/2 (in other words, 
TS's conformations have a probability 0.5 to fold before 
they unfold) [35| . If folding occurs via nucleation, con- 
formations that rapidly reach the native state with high 



probability Pfoid 3> 1/2 are post-transition state con- 
formations in which the FN is formed. The latter is in- 
deed a postcritical FN since its formation inevitably leads 
to the formation native state [4j. In the present study 
we are therefore interested in postcritical folding nuclei. 
An appropriate structural analysis of a significantly large 
ensemble of such conformations should therefore reveal, 
with a high degree of statistical confidence, a set of com- 
mon contacts which is the FN. To build such an ensemble 
we consider 1000 different folding events and, for each in- 
dividual event, we identify the earliest formed conforma- 
tion (EFC) that folds rapidly and with high probability, 
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fold- 



Ill order to determine the EFC for a given 



folding event conformations are sampled at times 
t s (n) = FPT - nAt, 



(3) 



where At is an appropriate sampling interval and n — 
1,2, .... More precisely, starting with n = 1, the folding 
probability, Pfoid, of the conformation collected at time 
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t s (l) is computed; this amounts to determining the frac- 
tion of folding simulations (in a set of 100 MC runs) 
which, starting from that conformation reach the na- 
tive state without passing through conformations with 
Q < Qu, i.e., the protein folds before it unfolds (we con- 
sider a protein to be unfolded if its fraction of native con- 
tacts is smaller than some cut-off Qu)- If Pfoid < P} id 
the conformation is discarded. Otherwise, if the folding 
time t is smaller than some cut-off time t max , the pro- 
cedure described above is repeated for n — 2 etc. The 
EFC for a given folding event is the conformation corre- 
sponding to the largest n which has Pfoid > Pfoid an<1 
t < t max . In the following section we discuss in some 
detail the procedure used to fix the parameters Qu, At 

and tjnax • 

Nucleation in the Go model 
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FIG. 2: Probability of finding a conformation with fraction of 
native contacts Q a function of Q. Qu is the fraction of native 
contacts below which the protein is considered to be unfolded. 
The probability of Q = 1.0 vanishes since the simulation stops 
when the protein reaches the native state. 



Determination of Qu , t max and At 

While it is trivial to identify the native state (since is 
the unique conformation with Q = 1.0) it is not straight- 
forward to decide weather a conformation belongs to the 
ensemble of unfolded conformations or is kinetically close 
(i.e., rapidly converts) to the native state. 

The fraction of native contacts Q has been exten- 
sively used in simulation studies as a reaction coordi- 
nate, i.e., as a parameter that quantifies the degree of 



folding m m [37], [38|]. In general, however, Q measures 
closeness to the native structure in energetic (or thermo- 
dynamic) terms only. It has been argued that, unless 
the energy landscape is considerably smooth, thermody- 
namic closeness does not necessarily imply kinetic prox- 
imity to the native structure [3^ ]. However, even if the 
suitability of Q as a reaction coordinate is questionable, 
very small Qs must necessarily identify unfolded confor- 
mations (i.e., that are thermodynamically and kinetically 
distant from the native fold ). 

In order to distinguish unfolded conformations from 
other conformers we have computed the probability of 
finding a conformation with a fraction of native contacts 
Q as a function of Q in a sample of 200 different folding 
events. Two peaks are apparent in the graph reported in 
Figure [2j a high-probability peak centered at Q = 0.088 
and another one, of considerably lower probability, that 
appears immediately prior to the native fold. The high 
probability peak is clearly associated with the unfolded 
states. The cut-off Qu is chosen such that more than half 
of the unfolded peak lies to the left of Q = Qu- In what 
follows we take Qu = 0.15 but note that other values 
of Qu were tested and were found to lead to the same 
results. 

The probability for the protein to be in high-Q con- 
formations is small but non negligible (Fig [5]). This hap- 
pens because the optimal folding temperature T opt , at 



which data was collected, is well below the system's fold- 
ing transition temperature Tf (Table [J). Accordingly, 
the protein may be trapped in low energy conformations 
that share a high degree of structural similarity with 
the native fold (i.e., whose fraction of native contacts 
is Q ~ 0.8). 

By definition, the formation of the FN prompts rapid 
and highly probable folding (Pfoid > Pfoid)- The cut-off 
parameter t rnax (i.e., the maximum number of MC steps 
in which the protein is required to reach the native fold) is 
therefore a particularly important step of the procedure 
proposed to identify the FN. 

A tentative sampling interval (about 2 orders of magni- 
tude smaller than the folding time for this model protein) 
was used to collect an ensemble of ~ 2000 conformations 
with Pf oid = 1 from 100 different folding events. The 
vast majority (> 90%) of such conformations were found 
to reach the native state in time t less than 1.4xl0 4 MCS 
while about 10% take a considerably longer time to fold 
(Figure©. 

Two (folding) time scales are clearly distinguished in 
this ensemble of conformations. The shorter time scale 
corresponds to conformations where the FN has the high- 
est probability of being formed, while the longer one is 
associated with folding events during which the protein is 
trapped in low energy states which, despite despite shar- 
ing a large similarity with the native fold, do not have the 
FN formed (Figure©. In order to eliminate the latter 
conformations t max was set to 1.4xl0 4 MCS. 

The efficiency of the sampling procedure may be im- 
proved by choosing the sampling interval, At, appropri- 
ately. Let FPT - FPTefc be the number of MC steps 
required to complete folding once the EFC forms at time 
FPTefc in a given folding event. We define ^efc as the 
average folding time of the EFC of 100 folding events (i.e., 
£efc is the average of FPT — FPTefc computed over 100 
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FIG. 3: Fraction of unfolded conformations as a function of 
time (in a log 10 -scale). About 90% of the conformations fold 
in 1.4xl0 4 MCS while the remaining 10% fold in times that 
are about one order of magnitude longer. 
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FIG. 4: A typical plot of Pf u as a function of n (with At = 
1000). Conformations collected at small n have a very high 
Pfoid and some of them have Pfoid = 1. 

folding events). Ideally, the sampling interval should be 
smaller than £efc, or at least of the same order of mag- 
nitude. In practice, for a tentative At, we compute £efc 
by averaging NAt in 100 folding events where N is the 
maximum value of n for each event. We fix At if the 
corresponding £efc lies between 5At or lOAt. For the 
model protein considered in this section we have found 
that i E FC ~ 6000 for At = 1000 MCS, which means 
that, on average, the EFCs are collected at a sampling 
time t s (6). 

In Figure |4] the dependence of Pfoid on n is shown for a 
single folding event. The folding probability is zero when 
n = 16 but as time approaches the FPT (i.e. for n < 
16) the protein explores a series of conformations, with 
Pfoid 7^ and reaches the native state with Pfoid = 1 
when n = 0. The conformations corresponding to n = 1 
and n = 2 have Pfoid = 1 as well and reach the native 
state in time tf < t max . Thus, the EFC for this folding 
event is the conformation which corresponds to n = 2. 



A folding nucleus determined solely by native topology 

Having fixed the parameters Qu, Ai and t max we 
ran 1000 different folding events from which an ensem- 
ble of 1000 conformations (one conformation per fold- 
ing run) were collected. The latter are all EFCs, i.e., 
the earliest conformations in folding events that collapse 
rapidly to the native state (i.e., their folding time is 
t < t max = 14000 MCS) with unit folding probability. 
The average fraction of native contacts of this ensemble 
of conformations is < Q >efc= 0.67. 

We start by labeling the 57 native contacts as in Table 

IE 

For each native contact we define the contact probabil- 
ity as the number of conformations in which the contact is 
formed normalized to the total number of conformations 
in the sample. Results reported in Figure [5] show that the 
contact probability varies considerably among the 57 na- 
tive contacts, an observation that is particularly evident 
for probabilities larger than 50%. This finding strongly 
suggests that, while the establishment of some contacts 
(e.g., 12 and 41, which are present in over 95% of the 
conformations analyzed) is an essential requirement to 
ensure rapid folding, the formation of others (e.g., 2 and 
54 which appear with probability <40%) does not appear 
to be a requisite to fast folding. The set of 9 contacts, 
involving residues 6, 8, 9, 11, 28, 31-33, 35, and 36 (Fig- 
ure[6j left), and identified by contact number in Figure[5l 
seems to be particularly relevant. Indeed, each individual 
contact is formed in more than 85% of the conformations 
analyzed, and all of the 9 contacts are simultaneously 
formed in 64% of the conformers. Moreover, on average, 
8.2 of them are present in the ensemble of conformations 
considered. 

The fact that rapid folding is associated with the for- 
mation of a set of highly probable contacts suggests that 
such a contact set is the FN. 

There is, of course, a certain degree of arbitrariness 
in the choice of the probability cut-off that is used to 
identify the highly probable contacts, and therefore the 
set of contacts identified above is a putative FN. 

Nucleation in the Shakhnovich model 

In order to investigate the importance of amino acid 
sequence in the formation of the FN, we studied the fold- 
ing of two non-homologous sequences (numbered 1 and 
2) (Table |T|. 

Determination of Qu , At and t max 

In the Go model the so-called topological frustra- 
tion [4fj| results from polymer properties of the chain 
such as connectivity 0, |35| , excluded volume effects, and 
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TABLE II: For structures that like ours are maximally compact cuboids with N = 48 residues there are 57 native contacts. 
This table displays the correspondence between the contact number and the pair of residues involved in each contact. 
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FIG. 5: The contact histogram for the Go model showing, for 
each native contact, the probability of being formed in the en- 
semble of 1000 EFC conformations that fold in time t < 1300 
MCS with unit folding probability. The nine contacts identi- 
fied by number have the highest probability (i.e. probability 
> 85%) of being formed. 



quirks of the native topology, such as lack of symme- 
try [4l|. Topological frustration is the only type of frus- 
tration in models which, like the Go model, are native 
centric. On the other hand, by taking into account the 
protein chemistry, the Shakhnovich model also exhibits 
energetic frustration. The latter typically leads to longer 
folding times and, at temperatures below the folding 
transition temperature, the chain is prone to get trapped 



in low energy states [4l|. This implies that, in contrast 



with the Go model, for which T opt is well below Tf, the 
two Shakhnovich protein sequences have optimal fold- 
ing temperatures which are close to the system's folding 
transition temperatures (Table [J). Thus, although the 
observed folding times are longer than those found for the 
Go model (Table HJ, the Shakhnovich model proteins do 
not get trapped in high-Q, low energy states. Indeed, the 
Q probability distributions are not peaked in the high-Q 



(Q ~ 0.8) region (Figure[7]) although both models exhibit 
a well defined, high-probability low-Q peak, at Q = 0.20, 
corresponding to the unfolded states. Applying the same 
criterion for the choice of cut-off Qu, one considers a 
conformation unfolded if Q < Qu = 0.25. As before, we 
have found that the results for the FN are robust with 
respect to small variations in the choice of Qu- 

In order to fix t max , a set of ~ 1200 conformations (per 
sequence), with P^i^ = 0.90, is collected from 100 dif- 
ferent folding events and the corresponding folding times 
are measured. For sequence 1, two folding time scales 
are observed (Figure [5J top) . The fraction of native con- 
tacts in the ensemble of sequence l's conformations is 
Q = 0.72 ± 0.12. Since there is a small probability for se- 
quence 1 to be in conformations with Q ~ 0.7 (Figure [7]) 
the longer time-scale may be ascribed to the population 
of these relatively high-Q conformations which, being lo- 
cal energy minima, will slow down folding. In order to 
disregard these conformations the cut-off time is set to 
tmax = 30000 MCS. By contrast, for sequence 2 the fold- 
ing times are all of the same order of magnitude (Figure 
bottom) and there is no need to use a cut-off time, 

tmax- 

The reason for taking PJgi^ — 0.9, instead of Pf^^ — 
1.0 as in the Go model, is that the latter leads, in 
the Shakhnovich model, to an ensemble of conforma- 
tions with a high average fraction of native contacts 
(< Q >^ 0.85). The latter are practically folded and 
thus are not suitable to distinguish the contacts that be- 
long to the FN from other trivial contacts. 

To improve the efficiency of the sampling procedure we 
have, also for the Shakhnovich model proteins, optimized 
the sampling intervals as described previously. We have 
found that At = 1000 MCS works well for both proteins 
yielding i EF c ~ 9500 MCS and i EF c ~ 14000 MCS for 
sequence 1 and 2 respectively, i.e., on average the EFCs 
for sequence 1 are collected at t s (10) while for sequence 
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FIG. 6: The folding nucleus for the Go model (left), for sequence 1 (center) and for sequence 2 (right) is the set of 9, 10 and 8 
contacts, respectively, colored in red. Residues whose number along the sequence is less than 12 are colored in blue and those 
whose number along the sequence is larger than 26 are colored in red. 




FIG. 7: Probability of having a conformation with fraction 
of native contacts Q for sequences 1 (top) and 2 (bottom). 
The peak at small Q is well defined for both model proteins. 
The probability curve for sequence 2 falls sharply to zero as Q 
increases, while for sequence 1 there is a small probability for 
the system to be found in conformations with 0.5 < Q < 0.7. 
In either case the protein is considered to be unfolded when 
the fraction of native contacts is smaller than Qu = 0.25. 



FIG. 8: Fraction of unfolded conformations as a function of 
time starting from conformations with Pf id > 0.90. For se- 
quence 1 (top) two time-scales, differing by one order of mag- 
nitude, may be observed. In order to select the fastest folders 
the cut-off time is fixed at t max = 30000 MCS. For sequence 
2 (bottom) there is no need to use a cut-off time since all 
folding are of the same order of magnitude. 



Folding nuclei determined by topology and protein sequence 



2 they are collected at i s (14). 



Two ensembles, each comprising 1000 EFCs, were ob- 
tained for sequences 1 and 2 using the parameters dis- 
cussed in the previous section, with < Q >efc= 0.65 
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FIG. 9: Contact histograms for sequence 1 (top) and sequence 
2 (bottom). Contacts present with the highest probability 
(> 95%) are identified by contact number. 



and < Q >efc= 0.62 for sequences 1 and 2 respectively. 
These values of < Q > are similar to that of the Go model 
and considerably lower than those obtained if Pj old = 1.0 
is used for the Shakhnovich model, allowing the distinc- 
tion of the contacts in a putative FN from other spurious 
contacts. 

The native structure of sequences 1 and 2 is the same as 
that of the Go model and the same numbering of native 
contacts is used (Table HI)) . 

From the analysis of the contact histograms we ob- 
serve that some native contacts are present with very 
high probability (> 95%) (Figure [5]). We consider the 
putative FN as the set of the most probable contacts. 

For sequence 1 the FN is thus formed by 10 native 
linkages (identified by contact number in Figure [SI top) 
involving 12 residues (namely, 2, 4, 6, 7, 27, 28, 33, 34, 
35, 40, 41, and 43)(Figure[l center). The 10 contacts 
forming the FN are simultaneously present in 82% of the 
EFC conformations analyzed and, on average, the latter 
have 9.7 of these contacts formed. It is interesting to note 
that the average stability of the contacts forming the FN 
is 62% higher than the average stability of the 57 native 
contacts of the folded protein (Table IIIip . For sequence 
2 the FN is formed by 8 native contacts (identified by 
contact number in Figure O bottom) and 10 residues 
(namely, residues 5, 6, 7, 8, 32, 35, 39, 40, 44, and 46) 



(Figure [6l right). The 8 contacts forming the FN are 
simultaneously present in 90% of the EFC conformations 
analyzed and, on average, the latter have 7.9 of these 
contacts formed. In this case the average stability of the 
FN's contacts is 53% higher than the average stability of 
the protein's native contacts (Tabic UTT|l . 

The two folding nuclei have 2 native contacts (12 and 
27) and 4 residues in common. These native contacts are 
non-local linkages between residues 6 and 35 and between 
residues 7 and 40, suggesting that the establishment of 
the corresponding long range interactions might be de- 
terminant to ensure rapid folding. 

Structurally speaking the FN of sequence 1 consists of 
two loops, one formed by residues 2, 27, 41, and 6 and 
the other by residues, 41, 6, 40, and 7 (Figure[H center). 
Each of these loops is formed by contacts located in the 
interior of the protein, while in sequence 2 a significant 
fraction of the FN's contacts are located on the fold's 
surface (Figure O right). 



Nucleation scenarios and contact stability 

The Go FN shares 22% of its contacts with sequence 
1 and 33% with sequence 2. The presence of these con- 
tacts in the folding nuclei of the Shakhnovich models is 
driven by native topology. Indeed, the average stability 
of the Shakhnovich contacts that are also present in the 
Go model is up to 25% lower than the average stability of 
the remaining contacts in the FN (Table IIII1 columns 3 
and 4) but they are formed with equally high probability 
■ 95%. 

The extremely high probability (~1) of the contact be- 
tween residues 6 and 35 (i.e. contact 12 in the contact 
histograms) in all the three model proteins is a robust 
feature of the nucleation mechanism. Another interesting 
observation regarding these residues is that they make-up 
a network of 7 native contacts in the fold (whose average 
range is 25 units of backbone distance) and about half 
of these contacts are present in each FN which suggests 
that they might be key residues in the folding process. 
We have performed exhaustive single-point mutations in 
all of the 48 residues and, in agreement with the above 
hypothesis, we have found that two mutations, one on 
residue 6, and the other on residue 35, lead to the largest 
increases in folding times (the folding time increases by 
up to 6-fold with respect to that of the wild-type se- 
quence) [42l ]. 

The average stability of the Go FN's contacts that do 
not participate in the Shakhnovich folding nuclei, of se- 
quences 1 and 2, is up to 66% lower than the protein's 57 
native contacts (Tablc Hnl columns 1 and 5). By contrast, 
the contacts that are exclusive to the Shakhnovich fold- 
ing nuclei are up to 90% more stable than the protein's 57 
native contacts (Table Hill columns 1 and 4). Moreover, 
as we have already pointed out, the Shakhnovich folding 
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-0.4 
-0.6 
-0.8 



16 23 25 28 35 41 43 46 
contact 



FIG. 10: Energy of each native contact. Half of the most 
stable contacts, identified in the figure by contact number, 
are present in sequence's 2 folding nucleus. 



nuclei are up to 81% more stable than the protein's 57 
native contacts (Table Hill columns 1 and 2). 

Clearly, by ascribing different stabilities to the pro- 
tein's native contacts, the protein sequence promotes an 
overall change of the nucleation scenario, which in the 
Go model is driven solely by the topological features of 
the native fold. To see how this happens in more detail 
we investigated the effect of contact stability in the con- 
tact histogram (i.e. in the determination of the FN) of 
sequence 2. The most stable contacts in this case are 
contacts 1, 16, 23, 25, 28, 35, 41, 43, 46, 56 (Figure HUJ) 
and, not surprisingly, half of them belong to the FN (Fig- 
ure[H]). It is interesting to note that, by being particularly 
stable, some contacts may indirectly promote an increase 
in the probability of occurrence of other less stable con- 
tacts. This feature is well illustrated by residue 47 and 
the three contacts it establishes in the fold. The latter 
appear with considerably high probabilities in the con- 
tact histogram. The probabilities of contacts 23 and 56 
(which are considerably lower in the Go model) may be 
ascribed to their very high stabilities. However, contact 
2 is a neutral one and, in spite of its relative low stability, 
its probability is higher when compared with other stable 
contacts in the protein. This presumably happens be- 
cause the very high stability of contacts 23 and 56 forces 
residue 47 to be in its native environment (i.e. to have 
all of its native contacts formed simultaneously) which 
naturally increases the probability with which contact 2 
is formed. 

Stability is indeed a considerably determinant factor 
for the Shakhnovich FN, but is not the whole story. The 
presence of Go contacts in the nucleus, is not energeti- 
cally favorable (Table QUI columns 2 and 3), but is very 
relevant from a functional point of view as discussed in 
the next section. 



The 'topological' role of the folding nucleus 

Despite clear differences, which are driven by contact 
stability, the three folding nuclei are nonetheless topolog- 
ically similar. The residues that participate in the set of 
native contacts forming the folding nuclei split into two 
groups located in different regions of the protein chain. 
Indeed, in all cases there is a group of 4 residues located 
in one region of the chain that comprises residues 2 to 11 
and there is another group of 6 (or 8) residues located in 
a distant part of the chain that extends between residue 
27 and residue 46. This is illustrated in Figure [5] where 
the residues whose number along the sequence is less than 
12 are colored in blue while those whose number along 
the sequence is larger than 26 are colored in red. It then 
follows that more than two thirds of the contacts that 
make up the folding nuclei are non-local contacts whose 
range lies between 18 to 30 units of backbone separation. 
In the three protein models the FN performs the same 
'topological' role, that of linking residues located in two 
distant parts of the protein chain. 



CONCLUSIONS 

In the present work we have proposed and discussed 
in detail a methodology to the identify the folding nu- 
cleus (i.e. a specific subset of native contacts which, once 
formed, prompts very rapid and highly probable folding) 
in small lattice proteins and applied it to investigate the 
nucleation mechanism of three model proteins with chain 
length N=48. We have found that a folding nucleus (FN) 
which is solely driven by the native fold's topological fea- 
tures (as it happens in the Go model) is not globally 
robust with regard to protein sequence. The latter dis- 
tinguishes native contacts, based on the stability of their 
interaction energies, and the nucleation pattern is biased 
towards the most stable contacts. In other words: in 
a (more realistic) lattice model, like a sequence-specific 
one, the FN is, to some extent, formed by the most stable 
contacts, and the presence of other less stable contacts 
in the FN is uniquely determined by the fold's topology. 
However, we have found that, independently of protein 
sequence, the residues forming the three folding nuclei 
are distributed along the protein chain in a similar and 
well defined manner. Accordingly, the nucleation mech- 
anism comprises the coalescence of two distinct and dis- 
tant parts of the protein chain through the establishment 
of the long range interactions corresponding to the non- 
local contacts forming the FN. Therefore we conclude 
that the fold's topology determines, to a large extent, 
the overall position of the FN in the protein chain. How- 
ever, as shown by Tiana et al. [HI], sequences as dissimi- 
lar as ours may have a different set of key residues (e.g. 
residues 6 and 35 in our models) in the FN, which may 
lead to the latter being topologically distinct. 
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Mean energy per contact 




protein 


Sfn 


Sfn A Gofn 


Sfn A (~ Gofn) 


(~ Sfn) A Gofn 


Sequence 1 


-0.427 


-0.691 


-0.579 


-0.719 


-0.424 


Sequence 2 


-0.471 


-0.854 


-0.783 


-0.896 


-0.283 



TABLE III: Mean energy per contact in different contact sets. In the the first column the average is computed over the 
protein's 57 native contacts. Sfn stands for the Shakhnovich FN and Gofn stands for the Go FN. Accordingly, the second 
column displays the contact's mean energy in Sfn; the contact's mean energy in the set of contacts that are common to the 
Shakhnovich nuclei and Gofn is shown in the third column. The fourth column refers to the set of contacts that are in Sfn 
but not in Gofn and finally, in the fifth column, one considers the contacts that are in Gofn but not in Sfn. 



A particularly interesting finding of this work regards 
the existence of 2 residues which, in the three model sys- 
tems, are involved in about 30% of the contacts forming 
the FN and appear to be determinant in ensuring fast 
folding. We speculate that the network of native con- 
tacts formed by these residues is sufficient to determine 
the overall fold of the protein in a way that is similar 
to that found by Vendruscolo et al. [44[ for a 98-residue 
protein model off-lattice. 

Previous simulation efforts on lattice models have fo- 
cused on smaller (namely N=28 Q and N=36 as well 
as on proteins with the same chain length 45j . We have 
found that the size of the FN is similar to the size of the 
nuclei identified by Shakhnovich and collaborators (con- 
taining between 8 and 11 native contacts) which suggests 
that, at least for small proteins, the size of the FN does 
not depend on the size of the chain. This could provide 
an explanation for the small correlation between chain 
length and folding rates found in real proteins 

Generalizations of the methodology described here, 
may be useful to investigate the folding pathways of 
model proteins. A very preliminary analysis of our data 
indicates that there is a higher degree of structural sim- 
ilarity among the EFCs of the Shakhnovich model than 
among those of the Go model. Indeed, we have deter- 
mined how many different native contacts exist between 
each pair of conformations in the three ensembles that 
were used to identify the FNs (i.e. in the three ensem- 
bles of EFCs) and computed its mean value over the total 
number of possible pairs. We have found that, on aver- 
age, two EFCs in the Go model differ by 11.3 native con- 
tacts. Sequences 1 and 2, on the other hand, differ by 9.7 
and 7.2 native contacts respectively. We speculate that 
the higher structural similarity between conformations 
in the Shakhnovich model may be related to a smaller 
number of rapid folding pathways. However, a definite 
conclusion requires further quantitative analysis. 
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